bedricks@ohsu.edu
 hillali@ohsu.edu
 OHSU Gaines Hall 19 & 21 (we're neighbors)
For this workshop, Steven and I started with a whiteboard session where we came up with a bunch of ideas for exploring word use in these presidential debate transcripts. After much discussion, coffee, and lots of "it would be so cool if we could..." ideas, we settled on these questions to start our exploratory data analysis (EDA):
We'll use a number of R packages for this workshop (remember: install once per machine, load once per work session). You'll need all the following packages installed first. Here is a function for doing this in one fell swoop:
ipak <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}pkgs <- c("readr", "dplyr", "tidyr", "stringr", "lubridate", "ggplot2", "wordcloud",
"devtools", "RColorBrewer", "dygraphs", "xts") # list packages needed
ipak(pkgs) # take function, and give it that listNow we'll need to load 2 packages that are available only on Github, so we'll use the devtools package to run the function install_github to install then load those two packages. Note that the devtools:: command is not needed here because we already loaded that package above. But this command is helpful if you just want to use a single function from a package like we are doing here.
devtools::install_github("jbkunst/d3wordcloud")
library(d3wordcloud)
devtools::install_github("dill/beyonce")
library(beyonce)We'll use the readr package to read in our master_counts.csv file.
master_counts <- read_csv("./data/master_counts.csv", col_names = FALSE)Now we have created an R object called master_counts, and it contains all the data stored in the csv file we made in Python. Let's find out what kind of object we just made- fingers are crossed that it is a dataframe, which makes life easier in R...
str(master_counts) # str here stands for structureClasses 'tbl_df', 'tbl' and 'data.frame': 63763 obs. of 6 variables:
$ X1: chr "04 Feb 2016" "04 Feb 2016" "04 Feb 2016" "04 Feb 2016" ...
$ X2: chr "democrats" "democrats" "democrats" "democrats" ...
$ X3: chr "moderator" "candidate" "candidate" "candidate" ...
$ X4: chr "TODD" "SANDERS" "CLINTON" "SANDERS" ...
$ X5: chr "campaign" "results" "endorse" "california" ...
$ X6: int 2 1 1 1 1 1 3 1 2 2 ...
- attr(*, "spec")=List of 2
..$ cols :List of 6
.. ..$ X1: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ X2: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ X3: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ X4: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ X5: list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ X6: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
Neat, data.frame is one of the classes, and you can see that we have 6 variables. Since we have a dataframe, we can re-name our 6 variables with more sensible names than X1, X2, etc. by using names().
names(master_counts) <- c("date", "party", "speaker_type", "speaker", "word",
"counts")Did that work? We'll use dplyr::glimpse instead of str to do a quick sanity-check; compare the outputs for yourself.
glimpse(master_counts) # dplyr verb for dataframesObservations: 63,763
Variables: 6
$ date <chr> "04 Feb 2016", "04 Feb 2016", "04 Feb 2016", "04 ...
$ party <chr> "democrats", "democrats", "democrats", "democrats...
$ speaker_type <chr> "moderator", "candidate", "candidate", "candidate...
$ speaker <chr> "TODD", "SANDERS", "CLINTON", "SANDERS", "SANDERS...
$ word <chr> "campaign", "results", "endorse", "california", "...
$ counts <int> 2, 1, 1, 1, 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1...
Looking at the above output, you can see that one of our variables is named speaker_type, which looks to be one of two character strings: moderator or candidate. Let's look more closely at this.
master_counts %>%
select(speaker_type) %>% # select just this 1 variable
distinct() # yup, just one of 2 things# A tibble: 2 × 1
speaker_type
<chr>
1 moderator
2 candidate
master_counts %>%
filter(speaker_type == "candidate") %>% # cands only
distinct(speaker) %>% # unique names only
arrange(speaker) # put list in alphabetical order# A tibble: 31 × 1
speaker
<chr>
1 AUDIENCE
2 BROWNLEE
3 BUSH
4 CARSON
5 CHAFEE
6 CHRISTIE
7 CLINTON
8 CRAMER
9 CRUZ
10 EPPERSON
# ... with 21 more rows
So we confirmed that we have moderators and candidates as speakers, but woops! We see we have an 'AUDIENCE' candidate in the democratic party (who booed x2), and a lot of other unusual sounding speakers who are classified as candidates. Lesson of the day: always perform sanity checks with #otherpeoplesdata. Here is our plan:
cand_counts.stringr package to calculate str_length, then use dplyr::filter to exclude rows where length = 1.real_cands <- c("CLINTON", "SANDERS", "CHAFEE", "WEBB", "OMALLEY", "BUSH", "CARSON",
"CHRISTIE", "CRUZ", "HUCKABEE", "KASICH", "FIORINA", "RUBIO", "PAUL", "TRUMP",
"SANTORUM")cand_counts <- master_counts %>%
filter(speaker_type == "candidate" & # candidates only
(speaker %in% real_cands) & # but real candidates only
!(date =="06 Aug 2015") & # forget about this debate for now
!(str_length(word) <= 1)) # exclude short wordsIf you look in your global environment now, you can see that my cand_counts dataframe still has 6 variables, just like my master_counts dataframe, but now I only have 43213 as opposed to 63763 rows or observations in my dataframe.
We'll use the quanteda package's English stopwords list to remove words that we likely don't care about.
quanteda::stopwords("english") # see list of stopwords
`?`(stopwords # read note of caution
)As it states in the quanteda note of caution, stops words are an arbitrary choice and your mileage may vary here. Now we'll use our second verb for dataframes, dplyr::filter, to keep only those words stored in the word variable that are not in this list of stopwords. In R, ! is the logical operator that means "NOT."
cand_counts <- cand_counts %>% filter(!(word %in% stopwords("english"))) # keeps only rows with words not in listThe date of each debate is currently a character variable- we need to change it to a date! We'll use lubridate to make R recognize that a string like 04 Feb 2016 has specific information stored in it (as opposed to fruit salad, for example). Namely, the string 04 Feb 2016 tells me 3 important variables: day, month, and year, and in that order. Recognizing that information will allow us to sort by so that the debates are shown in chronological order when we start plotting. You could also calculate number of days/months/years between two debates. We'll let R know that we are inputting a string that is formatted day-month-year, so lubridate::dmy() is the right command for this job.
library(lubridate)
dmy("04 Feb 2016") # test this out- does it work?cand_counts <- cand_counts %>% mutate(date_time = dmy(date))
glimpse(cand_counts) # what changed?Observations: 41,609
Variables: 7
$ date <chr> "04 Feb 2016", "04 Feb 2016", "04 Feb 2016", "04 ...
$ party <chr> "democrats", "democrats", "democrats", "democrats...
$ speaker_type <chr> "candidate", "candidate", "candidate", "candidate...
$ speaker <chr> "SANDERS", "CLINTON", "SANDERS", "SANDERS", "CLIN...
$ word <chr> "results", "endorse", "california", "hat", "naval...
$ counts <int> 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 8, 2, 1, 1, 1...
$ date_time <date> 2016-02-04, 2016-02-04, 2016-02-04, 2016-02-04, ...
Now, let's calculate some new variables to describe the individual words spoken by the candidates: the length and number of syllables for each word. We'll use the stringr package, which gives us verbs for dealing with variables that contain character strings. We'll use it to strip the punctuation from each word (so "i'll" becomes "ill") then count the number of letters. Play around with these commands to get a feel for how the functions work (bonus points if you can identify who said the fruit salad line in the debates)!
line <- "The fruit salad of their life is what I will look at"
str_length(line) # from stringr, number of characters (including spaces)
syllables(line) # from quanteda, number of syllables
str_length("fruit")
syllables("fruit")We'll add these new variables using dplyr::mutate to our cand_counts dataframe.
cand_counts <- cand_counts %>%
mutate(word_nop = str_replace_all(word, "[[:punct:]]",""), # strip punctuation
word_length = str_length(word_nop), # stringr package
syl_count = syllables(word)) %>% # quanteda package
select(-word_nop) # drop that, we won't use it again
head(cand_counts)# A tibble: 6 × 9
date party speaker_type speaker word counts date_time
<chr> <chr> <chr> <chr> <chr> <int> <date>
1 04 Feb 2016 democrats candidate SANDERS results 1 2016-02-04
2 04 Feb 2016 democrats candidate CLINTON endorse 1 2016-02-04
3 04 Feb 2016 democrats candidate SANDERS california 1 2016-02-04
4 04 Feb 2016 democrats candidate SANDERS hat 1 2016-02-04
5 04 Feb 2016 democrats candidate CLINTON naval 1 2016-02-04
6 04 Feb 2016 democrats candidate CLINTON another 3 2016-02-04
# ... with 2 more variables: word_length <int>, syl_count <int>
Both our new variables, word_length and syl_count should be integers. Is that right? How can you tell?
An important thing to think about when starting out with any new dataset is to ask yourself: what is the unit of measurement? Usually the answer to this question starts with "I have one-row-per-___". Take a close look at our cand_counts dataframe. We have one-row-per-debate/speaker/word combination. For example, my first row currently looks like this:
head(cand_counts, n = 1)# A tibble: 1 × 9
date party speaker_type speaker word counts date_time
<chr> <chr> <chr> <chr> <chr> <int> <date>
1 04 Feb 2016 democrats candidate SANDERS results 1 2016-02-04
# ... with 2 more variables: word_length <int>, syl_count <int>
We know that certainly there are other rows with that same date, and that there are other rows with that same date where the speaker is also SANDERS. But we know that the count of 1 there tells us that Bernie said the word results during that February 4 debate just once.
On the other hand, given data in this one-row-per-debate/speaker/word dataframe, we can wrangle our data into different units for analysis, as long as those units can be calculated from what was measured. There are several possible units of measurement we can get to from here:
There is some important debate-level information we'll want to use for our EDA. These include:
Let's start with (1) and (2). We'll do these two counting tasks at the same time using more dplyr.
debates <- cand_counts %>%
group_by(date_time, party) %>%
summarise(tot_speakers = n_distinct(speaker), # count unique speakers
tot_tokens = sum(counts)) # sum word counts
debatesSource: local data frame [12 x 4]
Groups: date_time [?]
date_time party tot_speakers tot_tokens
<date> <chr> <int> <int>
1 2015-09-16 republicans 10 11382
2 2015-10-13 democrats 4 6405
3 2015-10-28 republicans 10 7220
4 2015-11-10 republicans 8 7218
5 2015-11-14 democrats 2 4197
6 2015-12-15 republicans 9 8705
7 2015-12-19 democrats 2 5419
8 2016-01-14 republicans 7 8341
9 2016-01-17 democrats 2 4100
10 2016-01-28 republicans 7 6212
11 2016-02-04 democrats 2 5939
12 2016-02-06 republicans 7 8536
You can see now that we have one-row-per-debate (which, since the debates for the two different parties happen on different dates, means that we automatically also have one-row-per-debate/party combination). We can do a quick plot here to show the relationship between total words spoken and total speakers per debate. I'll do this by creating a new object, sp (for speaker plot), and add (+) each layer one at a time.
sp <- ggplot(debates, aes(x = date_time, y = tot_speakers, colour = party, group = party))
sp <- sp + geom_point() # put dots at the x/y values
sp <- sp + geom_line() # connect the dots
sp <- sp + scale_x_date(name = "") # x-axis is a time variable
sp <- sp + scale_color_manual(values = c("dodgerblue", "firebrick"))
sp <- sp + scale_y_continuous(name = "number of candidates", breaks = seq(0,
10, 2))
sp <- sp + coord_cartesian(ylim = c(0, 10))
spThis plot is a good sanity check- we know that the total number of speakers has gone down with (almost) each debate in the GOP, but not so much in the democratic ones since there have been 2 candidates for 4 debates now. So now onto (3): we want to make a variable for the order of the debates, separately for each party. We'll use dplyr again to add our new variable, debate_num.
debates <- debates %>%
ungroup() %>%
arrange(party, date_time) %>% # dems first, GOP second
group_by(party) %>% # grouped by party
mutate(debate_num = seq_along(date_time)) # count unique values in order
debates # did it work?Source: local data frame [12 x 5]
Groups: party [2]
date_time party tot_speakers tot_tokens debate_num
<date> <chr> <int> <int> <int>
1 2015-10-13 democrats 4 6405 1
2 2015-11-14 democrats 2 4197 2
3 2015-12-19 democrats 2 5419 3
4 2016-01-17 democrats 2 4100 4
5 2016-02-04 democrats 2 5939 5
6 2015-09-16 republicans 10 11382 1
7 2015-10-28 republicans 10 7220 2
8 2015-11-10 republicans 8 7218 3
9 2015-12-15 republicans 9 8705 4
10 2016-01-14 republicans 7 8341 5
11 2016-01-28 republicans 7 6212 6
12 2016-02-06 republicans 7 8536 7
Next, we need to combine the per-debate data we made (debate_data) with our per-word data (cand_counts). We'll call this new data frame debates.
debates <- debates %>%
ungroup() %>% # this was grouped before, need to un-do
left_join(cand_counts) %>%
mutate(speaker = as.factor(speaker),
party = as.factor(party),
debate_num = as.factor(debate_num)) # useful for later plottingWhat can we do with this new data? This is actually a little tricky, because remember that we have this separate counts variable, which we can't ignore. We'll need to figure out how to weight each of these variables by its count, which won't make much sense at the word level, but will make for a more accurate estimate of average length and number of syllables for each candidate. So let's stop thinking about per-word data, and start thinking about the data we have for each candidate.
Now we are ready for prime time! We need to calculate summary statistics at the candidate level. But because we are interested in looking at the candidates' performances across debates, we'll calculate these summary statistics for each candidate at each debate. Here are 6 variables we need:
share_tokens): the proportion of words spoken during the debate by each candidate. This variable uses a per-debate variable, total words spoken (tot_tokens) as the denominator.ttr): the ratio of unique words (types) divided by total words (tokens) spoken by each candidateavg_length): mean word length used by each candidate. This variable uses a per-word variable, length of each word (word_length), weighted by counts as the numerator.avg_syl): mean number of syllables per word used by each candidate. This variable uses a per-word variable, syllable count per word (syl_count), weighted by counts as the numerator.cand_by_debate <- debates %>%
group_by(party, speaker, debate_num, date_time, tot_tokens, tot_speakers) %>%
summarise(tokens = sum(counts), #all words spoken
types = n_distinct(word), #all unique words spoken
num_length = sum(word_length*counts), #numerator in mutate later
num_syl = sum(syl_count*counts)) %>% #numerator in mutate later
mutate(share_tokens = tokens/tot_tokens,
ttr = types/tokens,
avg_length = num_length/tokens,
avg_syl = num_syl/tokens) %>%
select(-num_length, -num_syl) %>% # don't need these anymore
ungroup() # don't need grouping anymore
glimpse(cand_by_debate)Observations: 70
Variables: 12
$ party <fctr> democrats, democrats, democrats, democrats, demo...
$ speaker <fctr> CHAFEE, CLINTON, CLINTON, CLINTON, CLINTON, CLIN...
$ debate_num <fctr> 1, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 1, 2, 3, 4, ...
$ date_time <date> 2015-10-13, 2015-10-13, 2015-11-14, 2015-12-19, ...
$ tot_tokens <int> 6405, 6405, 4197, 5419, 4100, 5939, 6405, 4197, 5...
$ tot_speakers <int> 4, 4, 2, 2, 2, 2, 4, 2, 2, 2, 2, 4, 10, 10, 8, 9,...
$ tokens <int> 762, 2305, 2192, 2921, 1952, 3006, 2103, 2005, 24...
$ types <int> 411, 1018, 1073, 1223, 955, 1204, 881, 866, 1016,...
$ share_tokens <dbl> 0.11896956, 0.35987510, 0.52227782, 0.53902934, 0...
$ ttr <dbl> 0.5393701, 0.4416486, 0.4895073, 0.4186922, 0.489...
$ avg_length <dbl> 6.093176, 6.090239, 6.150091, 6.078740, 6.073770,...
$ avg_syl <dbl> 1.937008, 1.934924, 1.946168, 1.958233, 1.918545,...
Now, length and the number of syllables for each word are represented here by means- but there is another way. We could instead count the number of words spoken by each candidate at each word length.
get_length_counts <- cand_counts %>%
group_by(party, speaker, word_length) %>%
summarise(length_count = sum(counts)) %>% # counting words by length
ungroup() %>%
group_by(speaker) %>%
mutate(tokens = sum(length_count), # counting words across lengths
by_length_prop = length_count/tokens,
length_cat = cut(word_length, breaks = c(1, 2, 3, 4, 5, 25),
labels = c("2", "3", "4", "5", "6+")),
max_count = ifelse(by_length_prop == max(by_length_prop), 1, 0))# do these proportions sum to 1?
get_length_counts %>% group_by(speaker) %>% summarise(sum_check = sum(by_length_prop))We'll start with a word length plot, which I'll make as an object called wlp.
# stacked bar chart
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
wlp <- get_length_counts %>% ggplot(aes(x = speaker, y = by_length_prop, fill = factor(length_cat)))
wlp <- wlp + geom_bar(stat = "identity", alpha = 0.8)
wlp <- wlp + scale_fill_manual(values = mygop, name = "word length")
wlp <- wlp + scale_y_continuous(expand = c(0, 0))
wlp <- wlp + theme_fivethirtyeight()
wlp <- wlp + theme(axis.text.x = element_text(angle = 45, hjust = 1))
wlp# color by word length, speaker on x, position dodge
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(x = speaker, y = by_length_prop,
fill = factor(length_cat)))
gg_length <- gg_length + geom_bar(stat = "identity", position = "dodge", alpha = 0.8)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(axis.text.x = element_text(angle = 45, hjust = 1))
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_length# color by word length, speaker on x, position dodge
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(x = length_cat, y = by_length_prop))
gg_length <- gg_length + geom_bar(stat = "identity", alpha = 0.8)
gg_length <- gg_length + facet_grid(speaker ~ .)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_length# color by word length, speaker on x, facet by length_cat
mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
gg_length <- ggplot(get_length_counts, aes(y = by_length_prop, fill = factor(length_cat)))
gg_length <- gg_length + geom_bar(stat = "identity", position = "dodge", alpha = 0.8)
gg_length <- gg_length + facet_grid(length_cat ~ speaker)
gg_length <- gg_length + scale_fill_manual(values = mygop, name = "word length")
gg_length <- gg_length + scale_y_continuous(expand = c(0, 0))
gg_length <- gg_length + theme_fivethirtyeight()
gg_length <- gg_length + theme(panel.grid.major.x = element_blank())
gg_lengthLet's try to visualize this data.
lolli_plot <- ggplot(get_length_counts, aes(x = word_length, y = by_length_prop,
colour = factor(max_count))) + geom_segment(aes(xend = word_length, y = 0,
yend = by_length_prop), alpha = 0.7, lwd = 1) + geom_point(size = 2) + scale_colour_manual(values = c("firebrick",
"dodgerblue"), guide = FALSE) + scale_x_discrete(breaks = seq_along(get_length_counts$word_length),
name = "word length") + scale_y_continuous(name = "proportion of words") +
facet_wrap(~speaker)
lolli_plot %+% subset(get_length_counts, party == "republicans")Let's start just by comparing the two parties.
ggplot(cand_by_debate, aes(x = party, y = avg_length)) + geom_boxplot()ggplot(cand_by_debate, aes(x = party, y = avg_syl, group = speaker, fill = speaker)) +
geom_boxplot()ggplot(cand_by_debate, aes(x = speaker, y = avg_syl, fill = speaker)) + geom_boxplot() +
facet_wrap(~party, scales = "free_x")These are some seriously ugly plots. Here's a riddle: when is a boxplot not a box? Answer: when there is no variability, our box turns into a horizontal line, which you can see is the case for at least 3 candidates. Since I am plotting data per candidate per debate, this means that those candidates only have data for one debate. There are also a lot of candidates- let's winnow the field.
Let's look at the candidates who we know are still in as of 2016-02-27. First, let's create a variable that tells us who is in and who is out.
gop_in <- c("RUBIO", "BUSH", "CRUZ", "TRUMP", "CARSON", "KASICH")
dem_in <- c("SANDERS", "CLINTON")
cand_by_debate <- cand_by_debate %>% mutate(still_in = factor(ifelse(speaker %in%
c(gop_in, dem_in), 1, 0))) # 1 is inWhich words are popular in these debates? What are our candidates talking the most about? For these analyses, we need to use our cand_counts dataframe. Do a dplyr::glimpse to remind yourself what is going on in there.
cand_counts %>% group_by(word) %>% summarise(tot_count = sum(counts)) %>% top_n(10,
tot_count) %>% arrange(desc(tot_count))# A tibble: 10 × 2
word tot_count
<chr> <int>
1 people 1249
2 know 821
3 think 817
4 going 815
5 country 698
6 need 667
7 get 610
8 well 572
9 president 543
10 one 530
OK, people, huh? Let's make a wordcloud using the wordcloud package, because people like making those with text data! We'll use RColorBrewer for pretty colors.
word_corpus <- cand_counts %>% group_by(word) %>% summarise(corpus_count = sum(counts))
set.seed(123)
wordcloud(word_corpus$word, word_corpus$corpus_count, scale = c(5, 0.1), max.words = 50,
random.order = FALSE, colors = brewer.pal(8, "PuBuGn")[-(1:4)], rot.per = 0.25,
use.r.layout = FALSE)Now this is a static wordcloud. We can also make one that is interactive using the d3wordcloud package we installed from Github earlier.
library(d3wordcloud)
d3wordcloud(word_corpus$word, word_corpus$corpus_count, tooltip = TRUE)We can also make barplots of word frequencies for each party.
gop_bar <- cand_counts %>%
filter(party == "republicans", counts > 20) %>% #filter first
ggplot(aes(x = word, y = counts)) # set up x/y axes
gop_bar <- gop_bar + geom_bar(stat = "identity", fill = "firebrick")
gop_bar <- gop_bar + theme(axis.text.x = element_text(angle = 45, hjust = 1))
gop_bardem_bar <- cand_counts %>%
filter(party == "democrats", counts > 25) %>% #filter first
ggplot(aes(x = word, y = counts)) # set up x/y axes
dem_bar <- dem_bar + geom_bar(stat = "identity", fill = "dodgerblue")
dem_bar <- dem_bar + theme(axis.text.x=element_text(angle=45, hjust=1))
dem_barStep 1: load my function for calculating log-likelihood. You'll need to update the file path here
source("log-likelihood.R")Step 2: need to tidy our data
# library(tidyr)
to_ll <- debates %>% select(speaker, word, counts) %>% group_by(word, speaker) %>%
summarise(count_sum = sum(counts)) %>% spread(speaker, count_sum) %>% select(2:15,
word = word)
to_ll[is.na(to_ll)] <- 0Step 3: run the log likelihood function! Input: to_ll dataframe
# FYI: this takes a while.
from_ll <- LL(to_ll, threshold = 25, sig.level = 10.83)Error in if (speaker.word == 0) speaker.word <- 1e-04: argument is of length zero
Step 4: take the from_ll dataframe as output, and rank by speakers, keep only the top 25, and those in our "in" groups for each party
top10_by_speaker <- from_ll %>% group_by(speaker) %>% arrange(desc(LL)) %>%
mutate(num_rank = percent_rank(LL), row_rank = row_number(desc(num_rank))) %>%
filter(LL > 0, row_rank < 11, speaker %in% c(gop_in, dem_in))Error in eval(expr, envir, enclos): object 'from_ll' not found
Step 4: plot it!
ggplot(top10_by_speaker, aes(x = speaker, y = row_rank)) + geom_label(aes(label = top10_by_speaker$word,
fill = top10_by_speaker$speaker), color = "white", fontface = "bold", size = 5) +
theme_bw() + theme(legend.position = 1, plot.title = element_text(size = 18),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))) + labs(title = "Most Characteristic Words/Phrases per Person") +
xlab("") + ylab("Ranking") + scale_fill_manual(values = beyonce_palette(74)) +
annotation_custom(cartman, xmin = 0.5, xmax = 1.5, ymin = 0, ymax = -4) +
annotation_custom(stan, xmin = 1.5, xmax = 2.5, ymin = 0, ymax = -4) + annotation_custom(kyle,
xmin = 2.5, xmax = 3.5, ymin = 0, ymax = -4) + annotation_custom(kenny,
xmin = 3.5, xmax = 4.5, ymin = 0, ymax = -4) + annotation_custom(butters,
xmin = 4.5, xmax = 5.5, ymin = 0, ymax = -4) + annotation_custom(randy,
xmin = 5.5, xmax = 6.5, ymin = 0, ymax = -4)Error in ggplot(top10_by_speaker, aes(x = speaker, y = row_rank)): object 'top10_by_speaker' not found
boxplots of average word length per GOP candidate across debates
cand_counts <- cand_counts %>% select(counts, everything())
expand_counts <- cand_counts[rep(row.names(cand_counts), cand_counts$counts),
2:9]gop_word_length <- ggplot(data = subset(in_cands, party == "republicans"), aes(x = reorder(speaker,
avg_length), y = avg_length, colour = speaker)) + geom_boxplot(show.legend = FALSE,
outlier.size = 3) + scale_colour_manual(values = beyonce_palette(74)) +
scale_x_discrete(name = "") + scale_y_continuous(name = "average word length") +
coord_flip()
gop_word_lengthlet's add a layer of points on top
gop_word_length + geom_point(size = 3, show.legend = FALSE)hard to see overlapping points though, add jitter and alpha for transparency
gop_word_length + geom_jitter(size = 3, alpha = 0.5, show.legend = FALSE, width = 0.3,
height = 0)note: Trump is missing in debate 5
of total words spoken, what percent were 2-3-4-5+ syllable words
ggplot(cand_counts, aes(x = syl_count, y = counts)) + geom_bar(stat = "identity") +
facet_wrap(~speaker)Aesthetics I need: color/group by speaker; ttr on y-axis; debate number on x-axis The geom I want: lines
# install.packages('ggthemes') library(ggthemes)
ttr_plot <- ggplot(in_cands, aes(x = debate_num, y = ttr, color = speaker, group = speaker)) +
geom_line(lwd = 2) + geom_point(size = 3) + scale_colour_hc(name = "candidate") +
scale_x_discrete(name = "debate number") + scale_y_continuous(name = "type-token ratio")
ttr_plotsome extra fussing to add text labels and points, and themes, and change to date_time rather than debate number
# need to rotate x-axis and expand x axis
library(ggrepel)
ggplot(in_cands, aes(x = factor(date_time), y = ttr, color = speaker, group = speaker)) +
geom_line(lwd = 1.5) + geom_point(size = 2.5) + scale_colour_hc(guide = FALSE) +
scale_x_discrete(name = "debate number") + scale_y_continuous(name = "type-token ratio") +
geom_text_repel(data = in_cands[in_cands$debate_num == 1, ], aes(label = speaker),
size = 4, show.legend = F) + theme_fivethirtyeight()or add a geom_jitter
ggplot(in_cands, aes(x = speaker, y = ttr, color = speaker, group = speaker)) +
geom_jitter(size = 2.5, width = 0.5, height = 0, show.legend = F) + scale_colour_hc() +
scale_x_discrete(name = "debate number") + facet_wrap(~party, scales = "free_x") +
ggtitle("Per-debate type-token ratios by party and candidate") + theme_fivethirtyeight()cand_ts <- with(in_cands, xts(ttr, date_time))
dygraph(cand_ts)let's see how many words each candidate used that referenced key democrats
dems <- c("obama", "hillary", "bernie")
cand_counts %>% group_by(speaker) %>% filter(word %in% dems, party == "republicans")Source: local data frame [93 x 9]
Groups: speaker [10]
counts date party speaker_type speaker word date_time
<int> <chr> <chr> <chr> <chr> <chr> <date>
1 2 06 Feb 2016 republicans candidate TRUMP hillary 2016-02-06
2 10 06 Feb 2016 republicans candidate CRUZ obama 2016-02-06
3 2 06 Feb 2016 republicans candidate CRUZ hillary 2016-02-06
4 1 06 Feb 2016 republicans candidate CHRISTIE hillary 2016-02-06
5 7 06 Feb 2016 republicans candidate RUBIO hillary 2016-02-06
6 3 06 Feb 2016 republicans candidate CHRISTIE obama 2016-02-06
7 1 06 Feb 2016 republicans candidate BUSH hillary 2016-02-06
8 1 06 Feb 2016 republicans candidate CRUZ bernie 2016-02-06
9 1 06 Feb 2016 republicans candidate CARSON hillary 2016-02-06
10 3 06 Feb 2016 republicans candidate BUSH obama 2016-02-06
# ... with 83 more rows, and 2 more variables: word_length <int>,
# syl_count <int>
but we are missing words like "obamacare"; let's try again
dem_mentions <- cand_counts %>% mutate(obama = ifelse(str_detect(word, "obama"),
counts, 0), hillary = ifelse(str_detect(word, "hillary") | str_detect(word,
"secretary"), counts, 0), bernie = ifelse(str_detect(word, "sanders"), counts,
0)) %>% gather(key = mention_who, value = mention_count, obama:bernie) %>%
group_by(party, speaker, date_time, mention_who) %>% summarise(mention_sum = sum(mention_count)) %>%
mutate(mention_who = as.factor(mention_who))
dem_mentionsSource: local data frame [210 x 5]
Groups: party, speaker, date_time [70]
party speaker date_time mention_who mention_sum
<chr> <chr> <date> <fctr> <dbl>
1 democrats CHAFEE 2015-10-13 bernie 2
2 democrats CHAFEE 2015-10-13 hillary 0
3 democrats CHAFEE 2015-10-13 obama 1
4 democrats CLINTON 2015-10-13 bernie 5
5 democrats CLINTON 2015-10-13 hillary 5
6 democrats CLINTON 2015-10-13 obama 13
7 democrats CLINTON 2015-11-14 bernie 8
8 democrats CLINTON 2015-11-14 hillary 3
9 democrats CLINTON 2015-11-14 obama 3
10 democrats CLINTON 2015-12-19 bernie 10
# ... with 200 more rows
note assumptions I made about how GOPs refer by speaker to these folks debate number on x, mentions on y
library(wesanderson) # for quirky colors
ggplot(data = subset(dem_mentions, party == "republicans"), aes(x = factor(date_time),
y = mention_sum, colour = mention_who, group = mention_who)) + geom_line() +
geom_point(size = 3) + facet_wrap(~speaker) + scale_colour_manual(values = wes_palette("Darjeeling"),
name = "") + scale_x_discrete(name = "") + scale_y_continuous(name = "number of mentions")let's tidy this up a bit
tidy_dem_mentions <- dem_mentions %>%
right_join(cand_by_debate) %>% # dplyr merge
mutate(mention_prop = mention_sum/tokens) %>% # relative to that speakers' proportion of words
filter(speaker %in% c(gop_in, dem_in)) tidy_dem_mentionsSource: local data frame [153 x 16]
Groups: party, speaker, date_time [51]
party speaker date_time mention_who mention_sum debate_num
<chr> <chr> <date> <fctr> <dbl> <fctr>
1 democrats CLINTON 2015-10-13 bernie 5 1
2 democrats CLINTON 2015-10-13 hillary 5 1
3 democrats CLINTON 2015-10-13 obama 13 1
4 democrats CLINTON 2015-11-14 bernie 8 2
5 democrats CLINTON 2015-11-14 hillary 3 2
6 democrats CLINTON 2015-11-14 obama 3 2
7 democrats CLINTON 2015-12-19 bernie 10 3
8 democrats CLINTON 2015-12-19 hillary 2 3
9 democrats CLINTON 2015-12-19 obama 4 3
10 democrats CLINTON 2016-01-17 bernie 10 4
# ... with 143 more rows, and 10 more variables: tot_tokens <int>,
# tot_speakers <int>, tokens <int>, types <int>, share_tokens <dbl>,
# ttr <dbl>, avg_length <dbl>, avg_syl <dbl>, still_in <fctr>,
# mention_prop <dbl>
date on x, mentions relative to total number of words spoken on y
ggplot(data = subset(tidy_dem_mentions, party == "republicans"), aes(x = factor(date_time),
y = mention_prop, group = mention_who)) + geom_line(aes(colour = mention_who),
lwd = 2.5, alpha = 0.7) + geom_point(colour = "black", size = 1) + facet_wrap(~speaker) +
scale_colour_manual(values = wes_palette("Darjeeling"), name = "") + scale_x_discrete(name = "") +
scale_y_continuous(name = "mentions as proportion of total number of words spoken") +
theme_hc() + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.ticks = element_blank())But let's face it, for the GOP the only person worth mentioning is...
biggest_reagan_fan <- cand_counts %>% filter(party == "republicans" & speaker %in%
gop_in) %>% group_by(speaker, date_time) %>% mutate(reagan = ifelse(str_detect(word,
"reagan"), counts, 0)) %>% summarise(reagan_sum = sum(reagan))
biggest_reagan_fanSource: local data frame [41 x 3]
Groups: speaker [?]
speaker date_time reagan_sum
<chr> <date> <dbl>
1 BUSH 2015-09-16 5
2 BUSH 2015-10-28 0
3 BUSH 2015-11-10 2
4 BUSH 2015-12-15 1
5 BUSH 2016-01-14 0
6 BUSH 2016-01-28 1
7 BUSH 2016-02-06 2
8 CARSON 2015-09-16 2
9 CARSON 2015-10-28 1
10 CARSON 2015-11-10 0
# ... with 31 more rows
who is the biggest reagan fan?
annotate <- ggplot2::annotate # package tm masked this annotate command
ggplot(data = biggest_reagan_fan, aes(x = speaker, y = reagan_sum)) + geom_jitter(size = 4,
colour = "firebrick2", width = 0.6, height = 0.1, alpha = 0.8) + stat_summary(fun.y = sum,
fun.ymin = sum, fun.ymax = sum, geom = "crossbar", width = 0.6, colour = "dodgerblue") +
facet_wrap(~date_time, switch = "x", ncol = 6, scales = "free_x") + ggtitle("Which GOP candidate is the biggest Reagan fanboy?") +
theme_fivethirtyeight()mygop <- c("#0072B2", "#009E73", "#D55E00", "#CC79A7", "#F0E442", "#56B4E9")
tot_reagan <- biggest_reagan_fan %>% group_by(speaker) %>% summarise(tot_mentions = sum(reagan_sum))
gg <- ggplot(biggest_reagan_fan)
gg <- gg + geom_bar(aes(x = date_time, y = reagan_sum, fill = speaker), stat = "identity",
alpha = 0.8)
gg <- gg + scale_fill_manual(values = mygop)
gg <- gg + scale_y_continuous(expand = c(0, 0))
gg <- gg + facet_wrap(~speaker, switch = "x", ncol = 6, scales = "free_x")
gg <- gg + labs(x = NULL, y = NULL, title = "Reagan mentions")
gg <- gg + theme_fivethirtyeight()
gg <- gg + theme(panel.margin.x = unit(0.25, "cm"))
gg <- gg + theme(legend.position = "none")
gg <- gg + theme(panel.grid.major.x = element_blank())
gg